IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



UTILITY PATENT APPLICATION FOR: 

OPTIMIZATION OF CLOCK SCHEDULING FOR A 
SYNCHRONOUS SYSTEM 



Osamu Samuel NAKAGAWA 
224 Bradbury Lane, Redwood City, CA 94061 

Norman CHANG 
35864 Caxton Place, Fremont, CA 94536 

Shen LIN 

637 Matsonia Drive, Foster City, CA 94404 
Weize XIE 

10559 Sterling Boulevard, Cupertino, CA 95014 



HP 10004741-1 



PATENT 



I OPTIMIZATION OF CLOCK SCHEDULING FOR A SYNCHRONOUS SYSTEM 

2 

3 FIELD OF THE INVENTION 

4 

5 This invention relates generally to synchronous digital circuitry such as that used in 

6 computers and digital processing systems, and more particularly to synchronization of clock 

7 or timing signals using delay. 
8 

9 BACKGROUND OF THE INVENTION 

10 

I I Synchronous (i.e. , clocked or pulsed) circuits must operate within timing constraints. 

12 Synchronous logic circuits typically include a clock distribution network for providing a clock 

13 signal to various sub-circuits. A typical clock network includes one or more clock sources 

14 that are coupled to a number of clock "sinks." A clock sink is any circuit or set of circuits 

15 accepting a clock input. In their simplest form, clock sinks are flip-flops and latches. 

16 Examples of more sophisticated sinks, involving several flip-flops or latches in the same time 

17 domain, include registers, counters, and state machines. 
18 

19 Figure 1 illustrates clock distribution throughout parts of a synchronous logic circuit 

20 1 00. The circuit 1 00 may be a VLSI (very large scale integration) integrated circuit, such as a 

21 microprocessor. The circuit 100 comprises a clock 110 and several clock sinks 120 

22 connected to the clock 1 10 by clock lines (also called traces) 130. The circuit 100 also 

23 comprises logic circuits (not shown) interconnecting the sinks 120. The physical path lengths 

24 along the clock lines 130 to each sink 120 are generally different. Because of the non- 
25 uniform lengths of clock lines and other factors such as trace resistance, trace capacitance and 

26 load capacitance, clock pulses generally arrive at each sink 120 at different times. In other 

27 words, the clock signals are generally out of phase when they arrive at the sinks 120. A 

28 measure of this phase mismatch or non-uniformity in clock arrival time is "clock skew." The 

29 skew between a clock signal at two points is the time delay or phase difference between the 

30 clock signals at those two points. In the circuit 100 having several clock sinks 120, maximum 
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1 clock skew between any two sinks 120 is usually specified as some percentage (e.g., 10%) of 

2 the clock period. 
3 

4 Excessive clock skew is undesirable. If clock skew is too great, then the sinks 120 

5 may fail to operate together properly. If a clock signal arrives at a sink 120 too early or too 

6 late relative to other events, then the circuit 100 may experience race conditions. Another 

7 negative consequence is that clock skew can be a limiting factor in how fast the clock 110 can 

8 operate. Misoperation and clock speed limitations result from violations of a setup or hold 

9 time of a sink 120. 
10 

1 1 Figure 2 illustrates setup and hold times for the case where the sink is a latch. Figure 

12 2A shows a dynamic latch 200 accepting an input D and a clock signal CLK. The dynamic 

13 latch 200 produces an output Q, which follows the input D, after a small delay, when the 

14 clock signal CLK is high and remains in the same state when the clock signal CLK is low. 

15 Figure 2B shows waveforms of the clock signal CLK, the input D and the output Q. As 

16 shown in Figure 2B, the input D pulses high briefly. In order for the output Q to follow the 

17 input D, the input pulse must be high for a setup time T s before the clock signal CLK falls 

1 8 low and stay high for a hold time T H after the clock signal CLK falls low. The setup time T s 

19 or the hold time T H may be violated if the clock signal CLK drifts, as can happen when there 

20 is excessive clock skew. 

21 

22 In the prior art, there are several approaches for minimizing clock skew. One 

23 approach is "retiming," typified by U.S . Patent 5,849,610. Retiming is physical placement of 

24 sinks and re-routing of clock lines to equalize path length. However, perfect distance 

25 equalization is seldom possible. Other design considerations often mandate sink and trace 

26 placement. Furthermore, physical distance is only one factor affecting propagation delay. 
27 

28 Another approach is the method of U.S. Patent 6,075,832, which discloses variable 

29 delay elements on clock lines. The delay amounts are controlled by feedback control using a 

30 delay locked loop that dynamically equalizes delay during circuit operation. A disadvantage 
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1 of this method is that it is complex, requiring extra space and power in the circuit. 

2 

3 Yet another approach is "clock scheduling," such as disclosed in its most elementary 

4 form in U.S. Patent 5,758, 1 30. According to that patent, a delay is introduced along a shorter 

5 clock trace so as to synchronize its arrival with that of a longer clock trace. More generally, 

6 clock scheduling involves delaying clock signals before the clock inputs to various sinks, so 

7 that the clock signals arrive at their destination sinks at the same time (zero skew scheduling) 

8 or in some other desired relationship with each other (non-zero clock skew scheduling). Ivan 

9 S . Kourtney and Eby G. Friedman, "Timing Optimization through Clock Skew Scheduling," 

10 Kluwer Academic Publishers, 2000 (ISBN 0-7923-7796-6), which is hereby incorporated by 

1 1 reference, discusses non-zero clock scheduling problem in great detail and discloses solutions 

12 based on linear programming and quadratic programming. Unfortunately, both techniques, 

13 which are entirely deterministic, do not always converge to a solution to this problem. 
14 

15 SUMMARY OF THE INVENTION 
16 

17 In one respect, the invention is a method for determining a plurality of clock delay 

1 8 values. Each delay value is associated with a delay element on a clock line leading to a clock 

19 sink in a synchronous circuit. The method determines an initial set of delay values and 

20 executes an optimization algorithm, beginning with the initial set of delay values, to arrive at 

21 a set of delay values that at least approximately meet a criteria while satisfying timing 

22 constraints associated with selected pairs of logically connected clock sinks. The 

23 optimization algorithm randomly modifies the set of delay values. In one embodiment, the 

24 timing constraints are defined in terms of setup and hold times. In preferred forms, the 

25 optimization algorithm is a genetic algorithm or a gradient descent algorithm. 
26 

27 In another respect, the invention is an synchronous logic circuit having delay elements 

28 determined by the above method. 
29 

HP 10004741-1 

3 



1 In yet another respect, the invention is a computer readable medium on which is 

2 embedded computer software that performs the above method. 
3 

4 In comparison to known prior art, certain embodiments of the invention are capable of 



5 achieving certain advantages, including some or all of the following: (1) the algorithms are 

6 capable of converging when other solutions do not; (2) the algorithms are capable of 

7 converging to a better solution than prior art solutions; (3) the algorithms converge more 

8 quickly than other solutions; and (4) the algorithms are adaptable to variations from device to 

9 devices. Those skilled in the art will appreciate these and other advantages and benefits of 
10 various embodiments of the invention upon reading the following detailed description of a 



1 1 preferred embodiment with reference to the below- listed drawings. 
12 

1 3 BRIEF DESCRIPTION OF THE DRAWINGS 
14 

15 Figure 1 is a diagram illustrating clock distribution throughout parts of a synchronous 

16 logic circuit; 

17 Figure 2 illustrates setup and hold times during a pulse of a clock signal; 

18 Figure 3 is a block diagram illustrating clock distribution throughout parts of a 

19 synchronous logic circuit, according to an embodiment of the invention; 

20 Figure 4 is a diagram illustrating a configuration of tuned delay elements used for 

2 1 clock distribution along a local datapath, according to one embodiment of the invention; and 

22 Figures 5-7 are flowcharts of timing optimization algorithms, according to 

23 embodiments of the invention. 

24 

25 DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT 

26 

27 Figure 3 is a block diagram illustrating clock distribution throughout parts of a 

28 synchronous logic circuit 300, according to an embodiment of the invention. Like the circuit 

29 1 00, the circuit 300 comprises a clock 110 and several clock sinks 1 20 connected to the clock 

30 1 10 by clock lines 130. Unlike the circuit 100, the circuit 300 also comprises delay elements 

HP 10004741-1 

4 



1 340 connected before each sink 1 20. Although Figure 3 shows one delay element 340 for 

2 each sink 1 20, a delay element 340 can feed multiple sinks, or some sinks may not have delay 

3 elements on their clock inputs. The delay elements 340 can be buffers, series circuits of an 

4 even number of inverters or any other means for delaying a signal known to those skilled in 

5 the art. The delay elements 340 are adjustable either dynamically during operation of the 

6 circuit 300 or statically during manufacture or initial configuration of the circuit 300. 

7 Preferably, a delay element 340 is constructed as a series of even number of inverts, every 

8 second output of which is input to a multiplexer, whose select lines are used to control which 

9 output and thus how much delay is provided. Alternate structures for the delay elements 340 

10 include delay locked loops and a series of inverters having controllable power supply 

1 1 voltages. Regardless of the structure of the delay elements 340, techniques for selecting the 

12 set of delay values for the delay elements 340 is described in detail below. 
13 

14 Figure 3 also illustrates a logic circuit 350 interconnecting some of the sinks 120. 

15 Many other logic circuits are likely present in the circuit 300, but only one is illustrated for 

16 clarity. The logic circuit 350 operates on data flowing from sink(s) to sink(s). Two sinks 120 

17 and the logic circuit 350 interconnecting them form a local datapath. The circuit 300, if 

18 typical of most logic systems, contains a large number of local datapaths. 
19 

20 Figure 4 illustrates a local datapath in greater detail, for the case where the sinks are 

21 latches. The local datapath begins at a latch 400 i5 where a signal Di is input and a signal Qi is 

22 output. The signal Qi is input to the logic circuit 350, which outputs a signal Dj. A latch 400j 

23 accepts as input the signal Dj and outputs a signal Qj. A clock signal CLK is also supplied to 

24 the latches 400i and 400j . The clock signal CLK going to the latch 400; has a skew Q relative 

25 to some reference. The clock signal CLK going to the latch 400j has a skew Q relative to the 

26 same reference. Delay elements 340i and 340j introduce additional delay in the clock signal 

27 CLK before reaching the latches 400i and 400j, respectively. The delay values for the delay 

28 elements 340i and 340j are chosen to satisfy the setup and hold time constraints of the latches 

29 400j and 400j, as will be explained next. 
30 
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1 Assume multiple paths traverse across the logic circuit 350. Denote the longest 

2 propagation delay of these paths as D^l, and the shortest as Dp. Then the setup and hold time 

3 constraints are given by the following inequalities: 

4 T Si + Q + Q + D ijL < T CLK+ + Cj + Gj (1) 

5 Q + Q + D ijS > Cj + Gj + T Hj (2) 



6 where Tclk+ is the nominal duration of a positive pulse on the clock signal CLK, assuming 

7 that the latches activate upon a positive clock pulse, and Tsi and T H j are the setup and hold 

8 times for the latches 400i and 400j, respectively. As a simplification, which is often true, the 

9 setup times and hold times for each latch may be assumed to be the same, in which case the i 
10 and j subscripts can be dropped from T Si and T H j. 

11 



12 As an alternative, the constraint inequalities (1) and (2) can be modified to include a 

13 safety margin, SM, as follows: 

14 T S i + Ci + Gi + D ijL +SM<TcLK + + C j + Gj (3) 

15 Ci + Q + D ijS - SM > Cj + Gj + T Hj (4) 

16 Including SM can result in a solution that is less sensitive to operating variations, such as 

17 might arise from voltage or temperature variations. However, including SM can also 

18 constrain the solution space such that a less optimal solution results. 
19 

20 The inequalities (l)-(4) are illustrative, and not limiting, examples of timing 

21 constraints between two sinks connected by logic. The inequalities (1 )-(4) are appropriate for 

22 the latches 400i and 400j, which are edge-triggered as shown in Figure 2B. Similar 

23 inequalities can be stated for latches triggered on the opposite clock edge, or for level- 

24 triggered sinks. Furthermore, other timing constraints involving other device and signal 

25 parameters are possible. 



26 

27 For a circuit containing N sinks, there are between order(N) and order(N 2 ) possible 

28 local datapaths. For each datapath there is a timing constraint, such as a pair of inequalities in 

29 the same form as (1) and (2) (or (3) and (4)), that must be satisfied. A solution to this set of 

30 inequalities is any set of delay values {Gk:l< k < N}satisfying all the inequalities. The 
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1 solution space may be further constrained by the realizable limits of the delay values (e.g., - 

2 1 28 ps (picoseconds) < G k < 1 28 ps and/or only discrete values of G k possible V k). Although 

3 there are potentially many solutions to this set of inequalities, a solution that is optimum or 

4 nearly optimum in some sense is preferred. There are many possible criteria that define the 

5 optimum, including, by way of illustration and not limitation, the following: 

6 min(T C LK + ) (5) 

7 min(£G k )ormax(£G k ) (6) 

8 minE(T skew (i,j) - Gij) 2 ] (7) 

9 In each case the criteria is minimization (or maximization) of some quantity, which is termed 

10 the "objective function." The criteria in expression (5) is minimization of the clock signal's 

1 1 positive pulse width and, indirectly, the clock period (i .e. , maximization of clock frequency). 

12 The objective function in expression (6) the total added delay, summed across all delay 

1 3 elements G k , 1 < k < N. Note that maximization of £G k , though not an intuitive thing to do, is 

14 a valid criteria; note that because clock signals are periodic, a large delay is functionally 

15 equivalent to a smaller delay modulo the clock period. The objective function in expression 

16 (7) is the sum of squared norms of total skew T skew (i j) from some target clock skew, Gy, such 

17 as a value in the middle of a permissible range, for all connected pairs i-j of delay elements. 

18 T skew (i j) may be defined as - Gj or (Q+GO - (Cj+Gi). 
19 

20 Given some optimization criteria, such as those in expressions (5)-(7), the problem is 

21 to select delay values G k , 1< k < N, within the solution space such that the optimization 

22 criteria is met or approximately met. In other words, the problem is to determine the 

23 optimum or approximately optimum solution, as defined by the given optimization criteria, 

24 subject to the setup and hold time constraints (which define the solution space). Methods for 

25 solving this problem are illustrated in Figures 5-7. 
26 

27 Figure 5 is a flowchart of a generalized timing optimization method 500. The method 

28 500 comprises two steps: determining (510) an initial set of delay values and performing 

29 (520) an optimization algorithm. The initial set of delay values may or may not be a solution, 

30 i.e., they may or may not satisfy the constraints, such as, for example, the setup and hold time 
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1 inequalities (1 ) and (2). The optimization algorithm, which is preferably iterative, refines the 

2 initial set of delay values so as to converge toward a solution that at least approximately 

3 meets a given criteria. In other words, the optimization algorithm searches the space of 

4 potential solutions to find a solution that is optimum or nearly optimum according to the 

5 given criteria. The optimization algorithm operates with some randomness, rather than being 

6 purely deterministic. As described below, the optimization algorithm is preferably a genetic 

7 algorithm or a gradient descent algorithm; however, other optimization algorithms involving a 

8 random component, such as simulated annealing or stochastic programming, are equally 

9 possible. 
10 

11 In one form, the optimization algorithm is a genetic algorithm. Figure 6 is a flow 

12 chart of a timing optimization method 600 based on a genetic algorithm. The method 600 

13 begins by determining (610) a set of potential solutions. Each potential solution is an N-tuple 

14 of delay values G k , 1 < k < N. The initial potential solutions are preferably random or 

15 determined by a linear programming or quadratic programming algorithm, which are well- 

16 known in the art but do not typically converge to the correct solution. 

17 

18 Next, the method 600 selects (620) parent potential solutions. Preferably, the 

19 selecting step 620 is performed by conducting a tournament. In a preferred tournament, the 

20 parents are selected randomly with probability proportional to their fitness values, which are 

21 measures of how well the solutions meet the optimization criteria. For example, if the 

22 optimization criteria is defined by minQTGk), then potential solutions in the population are 

23 assigned a selection probability inversely related to the quantity £Gk for each potential 

24 solution. Optionally, the function mapping from an objective function value to a fitness value 

25 could involve scaling. 
26 

27 Next, the method 600 crosses-over (630). The crossing-over step 630 is a breeding or 

28 mating step, to produce children potential solutions. In a preferred form, the crossing-over 

29 step 630 produces two children from two parents. In the case where each delay value G k is a 

30 discrete variable, a potential solution may be represented as a concatenation of the binary 
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1 representations of each delay value: Gi:G2:...:G N . With both parents represented in this 

2 manner, the crossing-over step 630 can be visualized by aligning a copy of one parent's bit 

3 pattern above a copy of the other's. Next, the crossing-over step 630 divides the parents' bit 

4 patterns into a number of regions. The number of regions and their endpoints are preferably 

5 random. Then, the crossing-over step randomly swaps the bits within each region, region by 

6 region independently. The resulting bit patterns are the two children. Each child has parts 

7 (i.e., regions) of each parent (except the very rare case in which no swaps occur). The size, 

8 number and locations of regions is arbitrary. 
9 

10 Next, the method 600 mutates (640) each delay value (Gk) in the children. The 

1 1 preferred mutation is a Gaussian randomized mutation on a sink-by-sink basis. In this case, 

12 the mutation step 640 adds to each delay value Gk in a child solution a Gaussian random 

13 variable. Preferably, the Gaussian random variable is N(0,o) where 3a is approximately one 

14 clock period. In one embodiment, each delay value Gk is discrete and quantized to 6 bits (and 

15 thus having 64 possible values) equally spaced from zero to one clock period, and the 

16 Gaussian random variable is likewise rounded to truncated to the same resolution. 
17 

18 After the mutation step 640, the method 600 checks (650) whether the mutated 

19 potential solution is a better than the previous one (before mutation). The checking step 650 

20 evaluates the relevant objective function for potential solutions before and after mutation. If 

21 the mutation is better, then the method 600 discards (660) the worst solutions from 

22 consideration and iterates the steps 610-650, returning to the determining step 610 to 

23 determine new potential solutions to replace the discarded ones. If the mutated potential 

24 solution is not better, then the method 600 tests (670) whether the goal has been reached or 

25 the maximum number of iterations have been performed. In either case, the method 600 

26 terminates. The goal may be expressed as being within some tolerance of the ideal or a 

27 desired solution. 
28 

29 Determination of which potential solutions are better or more fit preferably involves 

30 both constraint satisfaction and meeting the optimization criteria. A true solution (i.e., 
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1 satisfying the constraints) is better or more fit than a non-solution. Between two solutions, 

2 the one that better meets the optimization criteria is better or more fit than the other. 

3 

4 Genetic algorithms, per se, are known. In the parlance of genetic algorithms, each 

5 solution is a "chromosome," the whole set of potential solutions under consideration is a 

6 "population" or "colony," and each iteration of the steps 610-660 is an "epoch" or 

7 "generation." Those skilled in the art can vary the method 600 within the understanding of 

8 genetic algorithms. For example, the size of the population is largely arbitrary, but a 

9 population that is too small may take too long to converge while a population that is too large 
10 may never drift towards any direction and therefore never converge. 

11 

12 In another form, the optimization algorithm is a gradient descent algorithm. Figure 7 

13 is a flowchart of a timing optimization method 700 based on a gradient descent algorithm. 

14 The method 700 begins by determining (7 1 0) initial delay values. The initial delay values are 

15 preferably all-zero or determined by a linear programming or quadratic programming 

16 algorithm. The method 700 next perturbs (720) the delay values. The perturbations are 

17 preferably random steps that increase, decrease or hold unchanged each delay value. 

18 Preferably, the perturbation is one of three effects: increase by a fixed amount, decrease by 

19 the fixed amount or hold unchanged. In one embodiment, each delay value G k is discrete and 

20 quantized to a number of bits, the fixed amount is a single bit resolution. The method 700 

2 1 next checks (730) whether the perturbed solution is better than the unperturbed solution. The 

22 checking step 730 evaluates the relevant objective function for solutions before and after 

23 perturbation. If the perturbed solution is not better, the method 700 discards it and returns 

24 (740) to the last best solution before repeating the perturbation step 720. This loop continues 

25 until a better solution is found, at which point, the method 700 can be said to be moving in 

26 the right direction (or down the "gradient"). Thereafter, the method 700 continues to adjust 

27 (750) the delay values in the same direction (i.e., for each delay value adjusting the amount by 

28 which that delay value is increased or decreased). After each adjustment, the method 700 

29 tests (760) whether the goal has been reached or the maximum number of iterations have been 

30 performed. In either case, the method 700 terminates. 
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1 

2 Gradient descent algorithms, per se, are known. Those skilled in the art can vary the 

3 method 700 within the understanding of gradient descent algorithms. 
4 

5 The methods 500, 600 and 700 are preferably applied to a synchronous circuit or 

6 system near the final steps in production. After manufacture of a batch of circuits, the circuit 

7 parameters, such as longest and short path delays, setup and hold times, and nominal clock 

8 skews can be measured for each device in the batch. Alternatively, these quantities can be 

9 analytically estimated for the entire batch before. However, manufacturing process variations 

10 can cause these parameters to differ from device to device. After measuring the pertinent 

1 1 parameters for a particular device, the methods 500, 600 or 700 can be used to determine the 

12 set of delay values to be programmed into the delay elements of the device. In this way, the 

1 3 methods 500, 600 and 700 are adaptable. Furthermore, the solutions attained by the methods 

14 500, 600 and 700 are often less sensitive to operating variations, such as voltage and 

1 5 temperature variations , especially when a safety margin is included in the constraint equations 

16 or when the optimization criteria involves target skew values. 
17 

18 The methods 500, 600 and 700 can be applied to an entire clock scheduling problem 

19 wholly. Alternatively, multiple instances of methods 500, 600 and 700 can be run, each 

20 applied to a partition of the overall clock scheduling problem; then, the individual solutions 

21 can be patched together. 

22 

23 The methods 500, 600 and 700 can be performed by computer programs, which can 

24 exist in a variety of forms both active and inactive. For example, they can exist as software 

25 program(s) comprised of program instructions in source code, object code, executable code or 

26 other formats. Any of the above can be embodied on a computer readable medium, which 

27 include storage devices and signals, in compressed or uncompressed form. Exemplary 

28 computer readable storage devices include conventional computer system RAM (random 

29 access memory), ROM (read only memory), EPROM (erasable, programmable ROM), 

30 EEPROM (electrically erasable, programmable ROM), and magnetic or optical disks or tapes. 

HP 10004741-1 

11 



1 Exemplary computer readable signals, whether modulated using a carrier or not, are signals 

2 that a computer system hosting or running the computer program can be configured to access, 

3 including signals downloaded through the Internet or other networks. Concrete examples of 

4 the foregoing include distribution of the programs on a CD ROM or via Internet download. 

5 In a sense, the Internet itself, as an abstract entity, is a computer readable medium. The same 

6 is true of computer networks in general. 
7 

8 What has been described and illustrated herein is a preferred embodiment of the 

9 invention along with some of its variations. The terms, descriptions and figures used herein 

10 are set forth by way of illustration only and are not meant as limitations. Those skilled in the 

11 art will recognize that many variations are possible within the spirit and scope of the 

12 invention, which is intended to be defined by the following claims — and their equivalents — 

13 in which all terms are meant in their broadest reasonable sense unless otherwise indicated. 
14 
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